Current Issue : January - March Volume : 2018 Issue Number : 1 Articles : 5 Articles
Selecting an appropriate recognition method is crucial in speech emotion recognition applications. However, the current methods\ndo not consider the relationship between emotions.Thus, in this study, a speech emotion recognition system based on the fuzzy\ncognitive map (FCM) approach is constructed. Moreover, a new FCM learning algorithm for speech emotion recognition is\nproposed. This algorithm includes the use of the pleasure-arousal-dominance emotion scale to calculate the weights between\nemotions and certain mathematical derivations to determine the network structure. The proposed algorithm can handle a large\nnumber of concepts,whereas a typicalFCMcan handle only relatively simple networks (maps). Different acoustic features, including\nfundamental speech features and a new spectral feature, are extracted to evaluate the performance of the proposed method. Three\nexperiments are conducted in this paper, namely, single feature experiment, feature combination experiment, and comparison\nbetween the proposed algorithm and typical networks. All experiments are performed on TYUT2.0 and EMO-DB databases.\nResults of the feature combination experiments show that the recognition rates of the combination features are 10%ââ?¬â??20% better\nthan those of single features. The proposed FCM learning algorithm generates 5%ââ?¬â??20% performance improvement compared with\ntraditional classification networks....
This paper discusses the results of the pilot experimental research dedicated to speech recognition and perception of the semantic\ncontent of the utterances in noisy environment. The experiment included perceptual-auditory analysis of words and phrases in\nRussian and German (in comparison) in the same noisy environment: various (pink and white) types of noise with various levels\nof signal-to-noise ratio. The statistical analysis showed that intelligibility and perception of the speech in noisy environment are\ninfluenced not only by noise type and its signal-to-noise ratio, but also by some linguistic and extralinguistic factors, such as the\nexisting redundancy of a particular language at various levels of linguistic structure, changes in the acoustic characteristics of the\nspeaker while switching fromone language to another one, the level of speaker and listener�s proficiency in a specific language, and\nacoustic characteristics of the speaker�s voice....
Audio fingerprinting has been an active research field typically used for music identification. Robust audio fingerprinting\ntechnology is used to successfully perform content-based audio identification regardless of the audio signal being\nsubjected to various types of distortion. These distortions affect the time-frequency correlation relating to pitch and\nspeed changes. In this paper, experiments are done using the computer vision technique ORB (Oriented FAST and\nRotated BRIEF) for robust audio identification. Investigations are conducted for ORB, relating to its advantage of robustness\nagainst distortions including speed and pitch changes. The ORB prototype compares the features of the spectrogram\nimage query to a database of spectrogram images of the songs. For the initial experiment, a Brute-Force matcher is used\nto compare the ORB descriptors. Results show that the ORB prototype performs robustly to real-world distortions with fast,\nreliable performance against distortions such as speed and pitch which justifies the research done....
Accurate emotion recognition from speech is important for applications like smart health\ncare, smart entertainment, and other smart services. High accuracy emotion recognition from Chinese\nspeech is challenging due to the complexities of the Chinese language. In this paper, we explore how\nto improve the accuracy of speech emotion recognition, including speech signal feature extraction\nand emotion classification methods. Five types of features are extracted from a speech sample:\nmel frequency cepstrum coefficient (MFCC), pitch, formant, short-term zero-crossing rate and\nshort-term energy. By comparing statistical features with deep features extracted by a Deep Belief\nNetwork (DBN), we attempt to find the best features to identify the emotion status for speech.\nWe propose a novel classification method that combines DBN and SVM (support vector machine)\ninstead of using only one of them. In addition, a conjugate gradient method is applied to train DBN\nin order to speed up the training process. Gender-dependent experiments are conducted using an\nemotional speech database created by the Chinese Academy of Sciences. The results show that DBN\nfeatures can reflect emotion status better than artificial features, and our new classification approach\nachieves an accuracy of 95.8%, which is higher than using either DBN or SVM separately. Results also\nshow that DBN can work very well for small training databases if it is properly designed....
An artificial neural network is an important model for training features of voice conversion (VC) tasks. Typically, neural\nnetworks (NNs) are very effective in processing nonlinear features, such as Mel Cepstral Coefficients (MCC), which\nrepresent the spectrum features. However, a simple representation of fundamental frequency (F0) is not enough for\nNNs to deal with emotional voice VC. This is because the time sequence of F0 for an emotional voice changes\ndrastically. Therefore, in our previous method, we used the continuous wavelet transform (CWT) to decompose F0\ninto 30 discrete scales, each separated by one third of an octave, which can be trained by NNs for prosody modeling\nin emotional VC. In this study, we propose the arbitrary scales CWT (AS-CWT) method to systematically capture F0\nfeatures of different temporal scales, which can represent different prosodic levels ranging from micro-prosody to\nsentence levels. Meanwhile, the proposed method uses deep belief networks (DBNs) to pre-train the NNs that then\nconvert spectral features. By utilizing these approaches, the proposed method can change the spectrum and the F0\nfor an emotional voice simultaneously as well as outperform other state-of-the-art methods in terms of emotional VC....
Loading....